10 research outputs found
GEC-DePenD: Non-Autoregressive Grammatical Error Correction with Decoupled Permutation and Decoding
Grammatical error correction (GEC) is an important NLP task that is currently
usually solved with autoregressive sequence-to-sequence models. However,
approaches of this class are inherently slow due to one-by-one token
generation, so non-autoregressive alternatives are needed. In this work, we
propose a novel non-autoregressive approach to GEC that decouples the
architecture into a permutation network that outputs a self-attention weight
matrix that can be used in beam search to find the best permutation of input
tokens (with auxiliary {ins} tokens) and a decoder network based on a
step-unrolled denoising autoencoder that fills in specific tokens. This allows
us to find the token permutation after only one forward pass of the permutation
network, avoiding autoregressive constructions. We show that the resulting
network improves over previously known non-autoregressive methods for GEC and
reaches the level of autoregressive methods that do not use language-specific
synthetic data generation methods. Our results are supported by a comprehensive
experimental validation on the ConLL-2014 and Write&Improve+LOCNESS datasets
and an extensive ablation study that supports our architectural and algorithmic
choices.Comment: ACL 202
Revisiting Mahalanobis Distance for Transformer-Based Out-of-Domain Detection
Real-life applications, heavily relying on machine learning, such as dialog
systems, demand out-of-domain detection methods. Intent classification models
should be equipped with a mechanism to distinguish seen intents from unseen
ones so that the dialog agent is capable of rejecting the latter and avoiding
undesired behavior. However, despite increasing attention paid to the task, the
best practices for out-of-domain intent detection have not yet been fully
established.
This paper conducts a thorough comparison of out-of-domain intent detection
methods. We prioritize the methods, not requiring access to out-of-domain data
during training, gathering of which is extremely time- and labor-consuming due
to lexical and stylistic variation of user utterances. We evaluate multiple
contextual encoders and methods, proven to be efficient, on three standard
datasets for intent classification, expanded with out-of-domain utterances. Our
main findings show that fine-tuning Transformer-based encoders on in-domain
data leads to superior results. Mahalanobis distance, together with utterance
representations, derived from Transformer-based encoders, outperforms other
methods by a wide margin and establishes new state-of-the-art results for all
datasets.
The broader analysis shows that the reason for success lies in the fact that
the fine-tuned Transformer is capable of constructing homogeneous
representations of in-domain utterances, revealing geometrical disparity to out
of domain utterances. In turn, the Mahalanobis distance captures this disparity
easily.Comment: to appear in AAAI 202
Sinkhorn Transformations for Single-Query Postprocessing in Text-Video Retrieval
A recent trend in multimodal retrieval is related to postprocessing test set
results via the dual-softmax loss (DSL). While this approach can bring
significant improvements, it usually presumes that an entire matrix of test
samples is available as DSL input. This work introduces a new postprocessing
approach based on Sinkhorn transformations that outperforms DSL. Further, we
propose a new postprocessing setting that does not require access to multiple
test queries. We show that our approach can significantly improve the results
of state of the art models such as CLIP4Clip, BLIP, X-CLIP, and DRL, thus
achieving a new state-of-the-art on several standard text-video retrieval
datasets both with access to the entire test set and in the single-query
setting.Comment: SIGIR 202
Artificial Text Boundary Detection with Topological Data Analysis and Sliding Window Techniques
Due to the rapid development of text generation models, people increasingly
often encounter texts that may start out as written by a human but then
continue as machine-generated results of large language models. Detecting the
boundary between human-written and machine-generated parts of such texts is a
very challenging problem that has not received much attention in literature. In
this work, we consider and compare a number of different approaches for this
artificial text boundary detection problem, comparing several predictors over
features of different nature. We show that supervised fine-tuning of the
RoBERTa model works well for this task in general but fails to generalize in
important cross-domain and cross-generator settings, demonstrating a tendency
to overfit to spurious properties of the data. Then, we propose novel
approaches based on features extracted from a frozen language model's
embeddings that are able to outperform both the human accuracy level and
previously considered baselines on the Real or Fake Text benchmark. Moreover,
we adapt perplexity-based approaches for the boundary detection task and
analyze their behaviour. We analyze the robustness of all proposed classifiers
in cross-domain and cross-model settings, discovering important properties of
the data that can negatively influence the performance of artificial text
boundary detection algorithms
Intrinsic Dimension Estimation for Robust Detection of AI-Generated Texts
Rapidly increasing quality of AI-generated content makes it difficult to
distinguish between human and AI-generated texts, which may lead to undesirable
consequences for society. Therefore, it becomes increasingly important to study
the properties of human texts that are invariant over text domains and various
proficiency of human writers, can be easily calculated for any language, and
can robustly separate natural and AI-generated texts regardless of the
generation model and sampling method. In this work, we propose such an
invariant of human texts, namely the intrinsic dimensionality of the manifold
underlying the set of embeddings of a given text sample. We show that the
average intrinsic dimensionality of fluent texts in natural language is
hovering around the value for several alphabet-based languages and around
for Chinese, while the average intrinsic dimensionality of AI-generated
texts for each language is lower, with a clear statistical
separation between human-generated and AI-generated distributions. This
property allows us to build a score-based artificial text detector. The
proposed detector's accuracy is stable over text domains, generator models, and
human writer proficiency levels, outperforming SOTA detectors in model-agnostic
and cross-domain scenarios by a significant margin
Acceptability Judgements via Examining the Topology of Attention Maps
The role of the attention mechanism in encoding linguistic knowledge has
received special interest in NLP. However, the ability of the attention heads
to judge the grammatical acceptability of a sentence has been underexplored.
This paper approaches the paradigm of acceptability judgments with topological
data analysis (TDA), showing that the geometric properties of the attention
graph can be efficiently exploited for two standard practices in linguistics:
binary judgments and linguistic minimal pairs. Topological features enhance the
BERT-based acceptability classifier scores by %-% on CoLA in three
languages (English, Italian, and Swedish). By revealing the topological
discrepancy between attention maps of minimal pairs, we achieve the human-level
performance on the BLiMP benchmark, outperforming nine statistical and
Transformer LM baselines. At the same time, TDA provides the foundation for
analyzing the linguistic functions of attention heads and interpreting the
correspondence between the graph features and grammatical phenomena.Comment: Accepted to EMNLP 2022 Finding
Topological Data Analysis for Speech Processing
We apply topological data analysis (TDA) to speech classification problems
and to the introspection of a pretrained speech model, HuBERT. To this end, we
introduce a number of topological and algebraic features derived from
Transformer attention maps and embeddings. We show that a simple linear
classifier built on top of such features outperforms a fine-tuned
classification head. In particular, we achieve an improvement of about
accuracy and ERR on four common datasets; on CREMA-D, the proposed
feature set reaches a new state of the art performance with accuracy .
We also show that topological features are able to reveal functional roles of
speech Transformer heads; e.g., we find the heads capable to distinguish
between pairs of sample sources (natural/synthetic) or voices without any
downstream fine-tuning. Our results demonstrate that TDA is a promising new
approach for speech analysis, especially for tasks that require structural
prediction. Appendices, an introduction to TDA, and other additional materials
are available here - https://topohubert.github.io/speech-topology-webpages/Comment: Accepted to INTERSPEECH 2023 conferenc
Betti numbers of attention graphs is all you really need
We apply methods of topological analysis to the attention graphs, calculated
on the attention heads of the BERT model ( arXiv:1810.04805v2 ). Our research
shows that the classifier built upon basic persistent topological features
(namely, Betti numbers) of the trained neural network can achieve
classification results on par with the conventional classification method. We
show the relevance of such topological text representation on three text
classification benchmarks. For the best of our knowledge, it is the first
attempt to analyze the topology of an attention-based neural network, widely
used for Natural Language Processing.Comment: This short paper was submitted to "Topological Data Analysis and
Beyond" Workshop at NeurIPS 2020 at July 2020, but wasn't accepted. Later the
ideas from this short paper found a rich development in arXiv:2109.04825 and
arXiv:2205.0963
Artificial Text Detection via Examining the Topology of Attention Maps
https://aclanthology.org/2021.emnlp-main.50/International audienceThe impressive capabilities of recent generative models to create texts that are challenging to distinguish from the human-written ones can be misused for generating fake news, product reviews, and even abusive content. Despite the prominent performance of existing methods for artificial text detection, they still lack interpretability and robustness towards unseen models. To this end, we propose three novel types of interpretable topological features for this task based on Topological Data Analysis (TDA) which is currently understudied in the field of NLP. We empirically show that the features derived from the BERT model outperform count-and neural-based baselines up to 10% on three common datasets, and tend to be the most robust towards unseen GPT-style generation models as opposed to existing methods. The probing analysis of the features reveals their sensitivity to the surface and syntactic properties. The results demonstrate that TDA is a promising line with respect to NLP tasks, specifically the ones that incorporate surface and structural information